Effective interactive data visualization with pandas and pygal

Introduction

I really like pandas – the powerful data analysis framework for Python. And I really like pygal – an interactive visualization library written in and for Python.

Why not put these two libraries together for effective data visualizations?

In this blog post, I want to show you some basic use cases and integration tips between pandas as pygal.

Data

We need some kind of data. Which one doesn't really matter. Here I have a dataset that was produced to measure the utilization of source code during program execution. It shows the lines of source code that were executed (covered) or missed during a production coverage measurement.

As usual, we load this data with pandas first.



In [1]:

    
import pandas as pd

raw = pd.read_csv("datasets/jacoco_production_coverage_spring_petclinic.csv")
raw.head()









    Out[1]:







  
    
      
      PACKAGE
      CLASS
      LINE_MISSED
      LINE_COVERED
    
  
  
    
      0
      org.springframework.samples.petclinic
      PetclinicInitializer
      0
      24
    
    
      1
      org.springframework.samples.petclinic.model
      NamedEntity
      1
      4
    
    
      2
      org.springframework.samples.petclinic.model
      Specialty
      0
      1
    
    
      3
      org.springframework.samples.petclinic.model
      PetType
      0
      1
    
    
      4
      org.springframework.samples.petclinic.model
      Vets
      4
      0

Let's create a nice dataframe that makes this data better consumable later.



In [2]:

    
df = pd.DataFrame(index=raw.index)
df['class'] = raw['PACKAGE'] + "." + raw['CLASS']
df['lines'] = raw['LINE_MISSED'] + raw['LINE_COVERED']
df['coverage'] = raw['LINE_COVERED'] / df['lines']
df.head()









    Out[2]:







  
    
      
      class
      lines
      coverage
    
  
  
    
      0
      org.springframework.samples.petclinic.Petclini...
      24
      1.0
    
    
      1
      org.springframework.samples.petclinic.model.Na...
      5
      0.8
    
    
      2
      org.springframework.samples.petclinic.model.Sp...
      1
      1.0
    
    
      3
      org.springframework.samples.petclinic.model.Pe...
      1
      1.0
    
    
      4
      org.springframework.samples.petclinic.model.Vets
      4
      0.0

Visualization

Setup

The following cell has nothing to do with pandas and pygal per se, but it enables us to embed interactive visualizations directly into this notebook. This is pretty cool, so we use this here!



In [3]:

    
from IPython.display import display, HTML

base_html = """
<!DOCTYPE html>
<html>
  <head>
  <script type="text/javascript" src="http://kozea.github.com/pygal.js/javascripts/svg.jquery.js"></script>
  <script type="text/javascript" src="https://kozea.github.io/pygal.js/2.0.x/pygal-tooltips.min.js""></script>
  </head>
  <body>
    <figure>
      {rendered_chart}
    </figure>
  </body>
</html>
"""

Basics

The core idea is to let pandas create the data in a format that pygal's visualizations can consume easily. So let's have a look at what pygal expects as input data.

Here is a basic example for a bar chart (adapted from pygal's documentation) and take a look at the visualization (hint: it's interactive!).



In [4]:

    
import pygal

bar_chart = pygal.Bar(height=200)
bar_chart.title = 'Browser usage evolution (in %)'
bar_chart.x_labels = map(str, range(2002, 2013))
bar_chart.add('Firefox', [None, None,    0, 16.6,   25,   31, 36.4, 45.5, 46.3, 42.8, 37.1])
bar_chart.add('Chrome',  [None, None, None, None, None, None,    0,  3.9, 10.8, 23.8, 35.3])
bar_chart.add('IE',      [85.8, 84.6, 84.7, 74.5,   66, 58.6, 54.7, 44.8, 36.2, 26.6, 20.1])
bar_chart.add('Others',  [14.2, 15.4, 15.3,  8.9,    9, 10.4,  8.9,  5.8,  6.7,  6.8,  7.5])
display(HTML(base_html.format(rendered_chart=bar_chart.render(is_unicode=True))))

One of the important lines it this one:

bar_chart.add('Firefox', [None, None,    0, 16.6,   25,   31, 36.4, 45.5, 46.3, 42.8, 37.1])

For each bar chart category (like "Firefox" or "Chrome"), we need to call the add function and provide the data.

Let's go back to our own dataset. First, we create a category that makes some kind of sense for our use case. Let's use the name of a technical aspect of a source code file as our category. We can find this information at a specific part in the class column (at least for most cases).



In [5]:

    
df['category'] = df['class'].str.split(".").str[-2]
df.head()









    Out[5]:







  
    
      
      class
      lines
      coverage
      category
    
  
  
    
      0
      org.springframework.samples.petclinic.Petclini...
      24
      1.0
      petclinic
    
    
      1
      org.springframework.samples.petclinic.model.Na...
      5
      0.8
      model
    
    
      2
      org.springframework.samples.petclinic.model.Sp...
      1
      1.0
      model
    
    
      3
      org.springframework.samples.petclinic.model.Pe...
      1
      1.0
      model
    
    
      4
      org.springframework.samples.petclinic.model.Vets
      4
      0.0
      model

Bar chart

OK, let's try to create a bar chart for the coverage data of each file. Based on this data, we can take the first step to get into the basic mechanics of the integration between pandas and pygal.



In [6]:

    
mean_by_category = df.groupby('category')['coverage'].mean()
mean_by_category









    Out[6]:





category
jdbc         0.000000
jpa          0.691558
model        0.739048
petclinic    1.000000
service      0.888889
util         0.135417
web          0.639809
Name: coverage, dtype: float64

We just iterate over all entries and add these to the bar chart by using a list comprehension.



In [7]:

    
bar_chart = pygal.Bar(height=200)
[bar_chart.add(x[0], x[1]) for x in mean_by_category.items()]
display(HTML(base_html.format(rendered_chart=bar_chart.render(is_unicode=True))))

So this is pretty standard and easy to do.

Let's look at a slightly more sophisticated use case: showing coverage values for all classes and color the classes accordingly to the category they belong to.

For this, a bar chart doesn't make sense anymore. So let's look at another visualization type.

Treemap

A treemap generates size-based tiles of a dataset and orders them together in a nicely way.

New tricks

The key idea to integrate pandas with pygal is to use the pandas' groupby-function to get the data in a format that pygal can consume. The special trick is to put all the coverage-values into a list for each category.



In [8]:

    
values_by_category = df.groupby(['category'])['lines'].apply(list)
values_by_category









    Out[8]:





category
jdbc               [7, 33, 17, 9, 26, 7, 8, 43]
jpa                               [8, 11, 2, 7]
model        [5, 1, 1, 4, 12, 5, 7, 40, 21, 12]
petclinic                                  [24]
service                                    [18]
util                              [5, 3, 6, 24]
web                 [36, 10, 30, 11, 16, 10, 2]
Name: lines, dtype: object

This format is exactly what pygal needs. Let's create the treemap out of this data by using a list comprehension again.



In [9]:

    
treemap = pygal.Treemap(height=200)
[treemap.add(x[0], x[1]) for x in values_by_category.items()]
display(HTML(base_html.format(rendered_chart=treemap.render(is_unicode=True))))

Adding labels

You might have noticed that the labels on mouse-over actions don't show the actual class name but rather the name of the category. Instead of passing a list of values, we need to differentiate between the actual value and the corresponding label for each value. We can do this by passing an appropriate dictionary.

chart.add('category', [{'value' : 1, 'label': 'one'}, {'value': 2, 'label': 'two'}])

Let's fix this with another trick: We can iterate of the necessary data during the grouping of the values. For this, we have to combine the data that we need with the zip command an build a data dictionary within in the apply action.



In [10]:

    
class_values_by_category = df.groupby(['category'], axis=0).apply(
    lambda x : [{"value" : l, "label" : c } for l, c in zip(x['lines'], x['class'])])
class_values_by_category









    Out[10]:





category
jdbc         [{'value': 7, 'label': 'org.springframework.sa...
jpa          [{'value': 8, 'label': 'org.springframework.sa...
model        [{'value': 5, 'label': 'org.springframework.sa...
petclinic    [{'value': 24, 'label': 'org.springframework.s...
service      [{'value': 18, 'label': 'org.springframework.s...
util         [{'value': 5, 'label': 'org.springframework.sa...
web          [{'value': 36, 'label': 'org.springframework.s...
dtype: object

If we generate the treemap once again, you can spot the difference in the visualization by hovering over the tiles with your pointing device.



In [11]:

    
treemap = pygal.Treemap(height=200)
[treemap.add(x[0], x[1]) for x in class_values_by_category.iteritems()]
display(HTML(base_html.format(rendered_chart=treemap.render(is_unicode=True))))

Adding color

In the final step, I want to show you how you can colorize the tiles as needed. In our case, the column coverage is a perfect candidate for this, because it shows the ratio of executed code lines. A value near 1 means that almost all code lines were executed. A value near 0 means that the code line didn't ran.

Let's see if we can visualize this in the treemap, too. For this, we need two things:

an indicator, that shows how much a class is covered (we have this information in the coverage column)
a spectrum of colors that we want to use to show how strong the indicator per entry is (we could use the metapher of hot and cold for values near 1 and 0 respecively)

There are many ways to do it, but the most basic way is so assign every indicator value a corresponding color. For this, we'll us a red to blue colormap from matplot lib an draw colors appropriately.



In [12]:

    
from matplotlib.cm import coolwarm
from matplotlib.colors import rgb2hex

df['color'] = df['coverage'].apply(lambda x : rgb2hex(coolwarm(x)))
df.head()









    Out[12]:







  
    
      
      class
      lines
      coverage
      category
      color
    
  
  
    
      0
      org.springframework.samples.petclinic.Petclini...
      24
      1.0
      petclinic
      #b40426
    
    
      1
      org.springframework.samples.petclinic.model.Na...
      5
      0.8
      model
      #ee8468
    
    
      2
      org.springframework.samples.petclinic.model.Sp...
      1
      1.0
      model
      #b40426
    
    
      3
      org.springframework.samples.petclinic.model.Pe...
      1
      1.0
      model
      #b40426
    
    
      4
      org.springframework.samples.petclinic.model.Vets
      4
      0.0
      model
      #3b4cc0



In [13]:

    
class_ratios_by_category = df.groupby(['category'], axis=0).apply(
    lambda x : [
        {"value" : y,
         "label" : z,
         "color" : c} for y, z, c in zip(
            x['lines'],
            x['class'],
            x['color'])])
class_ratios_by_category









    Out[13]:





category
jdbc         [{'value': 7, 'label': 'org.springframework.sa...
jpa          [{'value': 8, 'label': 'org.springframework.sa...
model        [{'value': 5, 'label': 'org.springframework.sa...
petclinic    [{'value': 24, 'label': 'org.springframework.s...
service      [{'value': 18, 'label': 'org.springframework.s...
util         [{'value': 5, 'label': 'org.springframework.sa...
web          [{'value': 36, 'label': 'org.springframework.s...
dtype: object

Let's plot this treemap. We disable the legend, because it doesn't make sense anymore (the colors of the legend doesn't represent the colors in the treemap anymore).



In [14]:

    
treemap = pygal.Treemap(height=200, show_legend=False)
[treemap.add(x[0], x[1]) for x in class_ratios_by_category.iteritems()]
display(HTML(base_html.format(rendered_chart=treemap.render(is_unicode=True))))

Hacking the system

One problem exists though: The value in the lower left corner in the tooltip is the lines value. In the case that we want to display another value there (e. g. the coverage value), we need to hack the system a little bit by introduction a value formatter. This formatter needs a formatting function that we can happily provide (but surley not in a way the library designer originally thought how to do uit ;-) ).



In [15]:

    
class_ratios_hack_by_category = df.groupby(['category'], axis=0).apply(
    lambda x : [
        {"value" : y,
         "label" : z,
         "color" : c,
         "formatter" : lambda x : "{0:.0%}".format(f)} for y, z, c, f in zip(
            x['lines'],
            x['class'],
            x['color'],
            x['coverage'])])
class_ratios_hack_by_category









    Out[15]:





category
jdbc         [{'value': 7, 'label': 'org.springframework.sa...
jpa          [{'value': 8, 'label': 'org.springframework.sa...
model        [{'value': 5, 'label': 'org.springframework.sa...
petclinic    [{'value': 24, 'label': 'org.springframework.s...
service      [{'value': 18, 'label': 'org.springframework.s...
util         [{'value': 5, 'label': 'org.springframework.sa...
web          [{'value': 36, 'label': 'org.springframework.s...
dtype: object



In [16]:

    
treemap = pygal.Treemap(height=200, show_legend=False, colors=["#ffffff"])
[treemap.add(x[0], x[1]) for x in class_ratios_hack_by_category.iteritems()]
display(HTML(base_html.format(rendered_chart=treemap.render(is_unicode=True))))

Gauge

There are many other visualization types that you can use with these tricks. Let's take a look at the dataset from the beginning.



In [17]:

    
mean_by_category









    Out[17]:





category
jdbc         0.000000
jpa          0.691558
model        0.739048
petclinic    1.000000
service      0.888889
util         0.135417
web          0.639809
Name: coverage, dtype: float64

We can visualize this e. g. as gauge chart.



In [18]:

    
gauge = pygal.SolidGauge(inner_radius=0.70)
[gauge.add(x[0], [{"value" : x[1] * 100}] ) for x in mean_by_category.iteritems()]
display(HTML(base_html.format(rendered_chart=gauge.render(is_unicode=True))))

Or in another variant of it...



In [19]:

    
gauge = pygal.Gauge(human_readable=True)
[gauge.add(x[0], [{"value" : x[1] * 100}] ) for x in mean_by_category.iteritems()]
display(HTML(base_html.format(rendered_chart=gauge.render(is_unicode=True))))

OK, STOP! Enough for today!

Conclusion

Allright, that's it for this blog post! I hope you have seen that (if you know some tricks), you can easily integrate pandas with pygal!

I find this combination a nice tradeoff between complexity and interactivity. Let me now if I can simplyfy or explain one or two things more deeply.

Maybe next time, we can take a look at some tricks regarding D3, can't we?

	PACKAGE	CLASS	LINE_MISSED	LINE_COVERED
0	org.springframework.samples.petclinic	PetclinicInitializer	0	24
1	org.springframework.samples.petclinic.model	NamedEntity	1	4
2	org.springframework.samples.petclinic.model	Specialty	0	1
3	org.springframework.samples.petclinic.model	PetType	0	1
4	org.springframework.samples.petclinic.model	Vets	4	0

	class	lines	coverage
0	org.springframework.samples.petclinic.Petclini...	24	1.0
1	org.springframework.samples.petclinic.model.Na...	5	0.8
2	org.springframework.samples.petclinic.model.Sp...	1	1.0
3	org.springframework.samples.petclinic.model.Pe...	1	1.0
4	org.springframework.samples.petclinic.model.Vets	4	0.0

	class	lines	coverage	category
0	org.springframework.samples.petclinic.Petclini...	24	1.0	petclinic
1	org.springframework.samples.petclinic.model.Na...	5	0.8	model
2	org.springframework.samples.petclinic.model.Sp...	1	1.0	model
3	org.springframework.samples.petclinic.model.Pe...	1	1.0	model
4	org.springframework.samples.petclinic.model.Vets	4	0.0	model

	class	lines	coverage	category	color
0	org.springframework.samples.petclinic.Petclini...	24	1.0	petclinic	#b40426
1	org.springframework.samples.petclinic.model.Na...	5	0.8	model	#ee8468
2	org.springframework.samples.petclinic.model.Sp...	1	1.0	model	#b40426
3	org.springframework.samples.petclinic.model.Pe...	1	1.0	model	#b40426
4	org.springframework.samples.petclinic.model.Vets	4	0.0	model	#3b4cc0